Indonesian Journal of Electrical Engineering and Computer Science 
Vol. 13, No. 2, February 2019, pp. 729~736 
ISSN: 2502-4752, DOI: 10.1159 1/tjeecs.v13.12.pp729-736 OJ 729 


A noble approach to develop dynamically scalable namenode in 
hadoop distributed file system using secondary storage 


Tumpa Rani Shaha!, Md. Nasim Akhtar’, Fatema Tuj Johora*’, Md. Zakir Hossain‘, 


Mostafijur Rahman’, R. B. Ahmad® 


!.2.4D epartment of Computer Science and Engineering, Dhaka University of Engineering and Technology (DUET), 


Gazipur, Bangladesh 


*Institute of Information Technology, Jahangirnagar University (JU), Bangladesh 
’Department of Software Engineering, Daffodil International University (DIU), Dhaka, Bangladesh 
'Department of Computer Science and Engineering, Daffodil International University (DIU), Dhaka, Bangladesh 
°Faculty of Informatics and Computing, University Sultan Zainal Abidin (UniSZA), 22200 Besut, Terengganu, Malaysia 


Article Info 
Article history: 


Received Nov 12, 2018 
Revised Dec 13, 2018 
Accepted Dec 27, 2018 


ABSTRACT 


For scalable data storage, Hadoop is widely used nowadays. It provides a 
distributed file system that stores data on the compute nodes. Basically, it 
represents a master/slave architecture that consists of a NameNode and 
copious Data Nodes. Data Nodes contain application data and metadata of 
application data resides in the Main Memory of NameNode. In cached 


approach, they fragment the metadata depending on the last access time and 
move the least frequently used data to secondary memory. If the requested 
Keywords: data is not found in main memory then the secondary data will be loaded 
again on the RAM. So when the secondary data reloads to the primary 


DataNode memory then the NameNode main memory limitation arises again. The focus 
Hadoop of this research is to reduce the namespace problem of main memory and to 
Metadata make the system dynamically scalable. A new Metadata Fragmentation 
NameNode Algorithm is proposed that separates the metadata list of NameNode 


Secondary Storage dynamically. The NameNode creates Secondary Memory File in perspective 


of the threshold value and allocates secondary memory location based on the 
requirement. According to the proposed algorithm the maximum third, out of 
fourth of main memory is used at the secondary file caching time. The free 
space aids in faster operation by Dynamically Scalable NameNode approach. 
This proposed algorithm shows that the space utilization is increased to 17% 
and time utilization is increased to 0.0005% with the comparison of the 
existing fragmentation algorithm. 
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1, INTRODUCTION 

In this modern age, it has become the main concern to handle the data that is being generated every 
day. Approximately 25 quintillion bytes of data are created every day and 90% of the data has been created in 
the last two years. This data are being generated from everywhere like sensors for gathering climate 
information, social media sites, transaction records, satellites etc. These data sets are immensely unstructured 
and as a result to process and estimate these big data is a great concern. As the data size has increased 
extremely RDBMS has found it challenging. More ever as these data sets are semi-structured and 
unstructured RDBMS cannot categorize as they are designed to handle structured data. This problem requires 
a database management system that is capable of analyze these data in an efficient and convenient way. 
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Apache Hadoop is such kind of DBMS for handling semi-structured and unstructured data that provides an 
open source, distributed database processing platform across several thousand nodes. More ever it is high 
speed and has greater tolerance to fault along with cost efficiency as it stocks data in small amount via 
multiple servers. 

File system metadata and application data are stored separately in the existing Hadoop Distributed 
File System. NameNode contains the metadata of the system and DataNode contain application data. Per 
cluster about tens of thousands of clients can access the Hadoop storage at a time. DataNode store the block 
of data in their local file system and NameNode store metadata of all the DataNode in their local file system. 
So if we try to extend the network or try to add new DataNode then because of NameNode main memory 
limitation we can’t extend the network. This namespace limitation is one of the important problems of 
existing Hadoop Distributed File System. 

This proposed Dynamically Scalable NameNode (DSN) approach introducing a Metadaya 
Fragmentation Algorithm (MFA) to fragment the metadata frequently and increase the namespace capacity 
dynamically by making the interaction between main memory and secondary memory of NameNode. 


2. LITERATURE REVIEW 

In the field of modern technology, the use of Hadoop for handling big data has become an active 
area of research. Several approaches have been suggested on this field. 

In [1], they developed the Hadoop Distributed File System on behalf of Yahoo. They have explained 
about the Hadoop architecture and showed the result for handling 25 Petabyte of data at Yahoo. 

A classification based metadata management system is proposed in [2]. They focused on reducing 
the bottleneck of the NameNode main memory. They fragment the metadata of NameNode based on the 
importance factor. They have calculated three (High, Medium and Low) types of importance factor (If). Hash 
table is used to represent high If, a tree map is used to represent medium If and sequence files are used to 
represent low If. 

A cached approach is proposed in [3] for addressing NameNode scalability in HDFS. Their main 
focus was to enhance the existing architecture. They fragment the metadata depends on the last access time 
and moved the least frequently used data to cache. They were able to remove 250MB of data from RAM. But 
for data searching when the requested data not found in main memory then the secondary data will be loaded 
again on the RAM. So when the secondary data reload to the primary memory the issue of NameNode main 
memory limitation arises again. 

In paper [4], they analyze the requirement like hardware, software, network environment for 
improving the performance of cloud computing. They developed a cache system in layered passion where the 
system has a client library and multiple cache services. Client library can access the files from the shared 
memory. This distributed cache system can manipulate large number of files with a millisecond level in 
highly concurrent environment. 

In [5], they developed a mechanism to improved Hadoop performance using metadata for handling 
big data. By assigning jobs to the DataNode, H2Hadoop was extended the ability of NameNode. They were 
successful for reducing CPU time and number of need operation. 

In [6-8], they proposed a system for improving metadata management in HDFS for small files. 
They focused on the small files in the main memory and provide archival methods for those small files. 

Distributed metadata management scheme is proposed in [9]. They proposed a system for 
distributed metadata management scheme in HDFS to improve the HDFS efficiency. 

In [10], the namespace is departed into several fragments. Replicas of each fragment are dispersed 
among the NN. More time is needed for metadata searching with synchronization because the fragmented 
namespaces are distributed among different NN 

In [11], they proposed a Dynamic Directory Partitioning (DDP) technique where they allowing 
directory metadata and file metadata in a diverse way. They improved the performance on scalability and 
adaptability. 

An efficient metadata management system is proposed in. They proposed directory level based 
metadata management which is more efficient than the directory sub tree partitioning and traditional hashing 
technique. 


3. RESEARCH METHOD 

The DSN methodology has the following design principle (1) Dynamically Scalable NameNode 
architecture and (2) working procedure. In this section, the system architecture and the working procedure of 
the DSN architecture is given. 
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3.1. Dynamically Scalable NameNode Architecture 

The Dynamically scalable NameNode architecture is shown in Figure 1. DNS has master/slave 
architecture. DNS cluster consists of a single NameNode, a master server that manages the file system 
namespace and regulates access to files by clients and a number of DataNode, usually one per node in the 
cluster. In overall, the DSN system consists of one NameNode, a group of DataNode, clients, main memory 
and secondary memory concept which is discussed in this section. 
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Figure 1. Dynamically Scalable NameNode Architecture 





3.2. NameNode 

The focal point of HDFS is NameNode. It keeps the track where the file data is kept over the cluster. 
The directory tree of all file in the system also kept here. When the clients wish to locate a file or they need to 
add/copy/delete/move a file then client applications send a request to the NameNode. The NameNode replies 
with corresponding DataNode address. 


3.3. Main Memory 

Normally the namespace of the Hadoop system is stored in NameNode main memory. In this 
proposed architecture we introduce Main Memory File (MMF) concept, which stores the high priority 
metadata of the system. 


3.4. Secondary Memory 

In this proposed architecture we introduced the secondary memory concept. The fragmented low 
priority metadata will store in the secondary memory. A lot of files can store in the secondary memory 
according to the proposed algorithm which is discussed in working procedure section. 


3.5. DataNode 

DataNode cache the data in the HDFS. DataNode talks to the NameNode to perform modifications 
of the data commanded by the NameNode and response to the NameNode after a fixed time interval 
continuously with a list of a chunk that they are storing for file system activity. Clients system can 
communicate to the DataNode directly if the NameNode has assigned the address of the DataNode. 


3.6. Clients 

Clients of the proposed system can request to the NameNode for any particular file. NameNode will 
reply with the address of the requested DataNode to the clients. Then clients directly communicate with the 
DataNode for reading or writing operation. 


3.7. Metadata 

HDFS metadata is divided into two categories of files named fsimage and edits log. The complete 
state of the file processing system at a point in time is content by the fsimage file. A unique increasing 
transaction id is assigned in every modification of file system. After all modification to that id fsimage files 
represents the file system state. 
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3.8. Working Procedure 

In this DSN architecture when Hadoop client request to NameNode, it firstly check the available 
space of MMF (Main Memory Metadata File). If so then the new file is created. And default priority | 
(Lowest Priority) is set for the newly created file. But if there is no available space in MMF then least 
priority metadata will be moved to SMF (Secondary Memory Metadata File) following the proposed 
Metadata Fragmentation Algorithm (MFA). That is a priority based dynamic metadata classifier is proposed 
for the main memory utilization. For assigning priority let us assume the following parameters 


Tq =Fixed Time Interval 

H=Number of Hits during Tg 

MMS=Main Memory Size 

Min=Main Memory Threshold 

Stn=Secondary Memory Threshold 

MMF=Main Memory Metadata File 
SMF=Secondary Memory Metadata File 

S= Size of each metadata 

x= Number of metadata file for Min = (MMS/2)/S 
y= Number of metadata file for Sin = (MMS/4)/S 


Generally, the full fsimage file is stored in the main memory of NameNode. To fragment the 
fsimage file threshold value (Mth) is calculated by (MMS)/2. That is half of the main memory size is the 
threshold for MMF. Secondary Memory Threshold (Sth) value is calculated by (MMS)/4. So x is the number 
of metadata file that can be stored on Mth and y is the number of metadata file that can be stored on Sth. 
Figure 2 shows the metadata fragmentation algorithm. 


1. If MMF >My, then 

2. Calculate new Priority value (P)= Average 

(Old Priority, H) 

3. Sort the metadata depending on P in descending 
order 

4. Keep high order x factor of data in MMF 

5. Shift rest lowest data to SMF [1i=1....n] 


6. If SMF[i] > Sy, then 

7. Repeat step 2 & 3 

8. Keep high order y factor of data in SMF{[i] 
9. Shift rest low factor data to SMF[i+1] 

10. end if 

11. end1 





Figure 2. Metadata Fragmentation Algorithm 


When the size of the metadata file exceeds the Mth then the fragmentation algorithm is triggered. 
When the threshold value exceeds, then the priority value for each metadata will be updated frequently if 
needed based on trigger. Newly generated priority values are sorted (higher to lower order) and metadata 
having higher priority will keep to the MMF. That is x number of metadata has been stored in MMF 

Low priority metadata records are separated out and moved into the file created on secondary 
storage. As low priority metadata frequently moves to the secondary storage so the number of SMF will 
extend according to the size of metadata. The number of metadata has been stored in each SMF is measured 
by factor y and they must be stored according their higher to lower priority. Let consider the size of the main 
memory is | GB, then the threshold value (Mth) will be 512 MB and the size of each fragmented file in the 
secondary memory (Sth) is | GB/4=256 MB. If we consider that size of each metadata is 1MB then MMF 
can contain 512 metadata which is factor x. 

When the user searches any particular file, the system will search that data in the main memory first. 
If it is found, the file will be replied to the user with the DataNode address. But if it is not found in the main 
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memory then according to the priority value the requested file will be cached to the main memory from the 
secondary memory through page table which is shown in Figure 3. 


Address Translation 


Memory Resident SMFn 


Page Tabl 
ai Secondary Memory 





Figure 3. Secondary File Caching 


4. RESULTS AND ANALYSIS 

To evaluate the performance of the MFA algorithm we have conducted two kinds of test: 1. 
Performance on main memory usages 2. Performance on average response time. In this section we have 
demonstrated the performance of the DSN approach and the comparison with the existing cache approach. 


4.1. Simulation Platform 

We have developed the MFA and existing fragmentation algorithm using C++ language in two 
different computers. One of those is 4GB RAM with 2.10 GHz Core 13 processor and another one is 8GB 
RAM with 1.60GHz Core 15 processor. 


4.2. Performance on Main Memory Usages 

In this section the performances on main memory usages of DSN approach and existing cached 
approach in terms of size of main memory is discussed. Figure 4 shows the NameNode main memory usage 
comparison. 
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Figure 4. NameNode Main Memory Usage Comparison 


A noble approach to develop dynamically scalable namenode in hadoop distributed... (Tumpa Rani Shaha) 


734 O ISSN: 2502-4752 


According to the MFA the use of RAM for the Dynamically Scalable NameNode approach is 
calculated by the size of x factor, y factor and the size of each metadata. After the fragmentation of cache 
approach the main memory can store 7OOMB metadata and 250MB data in the secondary memory of 1GB 
RAM [3]. But in DSN system the main memory is able to hold 512MB and 256MB in secondary memory 
after the metadata fragmentation algorithm trigger. The Secondary Memory can store several files of size 
256MB. So the storage capacity has been increased dynamically. 

Existing cached approach is used 92% of RAM and the DSN algorithm required maximum 75% of 
main memory in worst case. So this DSN approach is utilized average 17% of main memory usage. This free 
space of main memory ensure the overall response time of the NameNode. 


4.3. Performance on Response Time 

In this section the performances on average response time of DSN approach and existing cached 
approach is discussed. For analyzing the average response time of the NameNode, we have made a setup to 
simulate of proposed and existing MFA algorithm in two well configured computers. Setup-1: 4GB RAM 
with 2.10 GHz Core 13 processor and Setup -2: 8GB RAM with 1.60GHz Core 15 processor. Figure 5 and 
Figure 6 show the average response time analysis of setup-1 and setup-2. 
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Figure 5. Average Response Time Analysis of Setup-1 


Let consider the size of each metadata (S) is IMB. Then the MME will contain 512 metadata which 
is factor X and each SMF can contain 256 metadata which is factor Y. 
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Figure 6. Average Response Time Analysis of setup-2 
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In simulation, the configuration of setup-2 1s higher than setup-1. So we can see that the average 
response time in setup-2 is less than setup -1. So here it is proved that this proposed system will provide 
better response time in high configured system. 


5. CONCLUSION 

In this proposed work we have experimented with a large amount of data efficiently thus the time 
requirements has been reduced and memory utilization is increased. The proposed system is more efficient 
than the existing cached approach that is proved by our performance evaluation section. By implementing the 
concept of secondary storage it has been shown that amount of metadata will not be so high that the 
NameNode will be irresponsive due to the excessive amount of data. At the same time the client request can 
be handled more frequently than the existing system. In future work we would like to introduce several 
parameters and be proved mathematically so that the system can work more efficiently and can be 
implemented in real time system. 
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